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O Abstract. We investigate the utility to contemporary Bayesian studies of re- 

^ cursive, Gauss-Seidel-type pathways to marginal likelihood estimation char- 

^ acterized by reverse logistic regression and the density of states. Through a 

1—5 pair of illustrative, numerical examples (including mixture modeling of the 

QO well-known 'galaxy dataset') we highlight both the remarkable diversity of 

CN bridging schemes amenable to recursive normalization and the notable effi- 

ciency of the resulting pseudo- mixture densities for gauging prior-sensitivity 
in the model selection context. Our key theoretical contributions show the 
^ connection between the nested sampling identity and the density of states. 

Further, we introduce a novel heuristic ('thermodynamic integration via 
importance sampling') for qualifying the role of the bridging sequence in 
, marginal likelihood estimation. An efficient pseudo-mixture density scheme 

for harnessing the information content of otherwise discarded draws in 
^ ellipse-based nested sampling is also introduced. 

O Key words and phrases: Bayes factor, Bayesian model selection, importance 

^ sampling, marginal likelihood. Metropolis-coupled Markov Chain Monte 

^ Carlo, nested sampling, normalizing constant, path sampling, reverse lo- 

^ gistic regression, thermodynamic integration. 

o 

m 

7^. 1. INTRODUCTION 

• ^ Though often secondary to parameter inference in the Bayesian paradigm, the 

^ normaUzing constant, Z, required to estabUsh the posterior, TT{6\y), as a proper 

probabihty density, 

(1) TT {9\y) = 7T (9) L{y\9)/Z where Z = J 7r{9)L{y\9)d9, 

for prior, tt{9), and Hkelihood, L{y\9), nevertheless plays a vital role in the domain 
of Bayesian model selection and model averaging (Kass & Raftery, 1995; Hoeting 
et al., 1999). Here Z is generally referred to as either the marginal likelihood (i.e., 
the likelihood of the observed data marginalized [averaged] over the prior den- 
sity) or the model evidence. With the latter term though, one risks the impression 
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of over-stating the value of this statistic in the case of hmited prior knowledge 
(cf. Gelman et al. 2004, ch. 6). Problematically, few complex statistical problems 
admit an analytical solution to Equation 1 , nor span such low dimensional spaces 
{D{6) < 5-10) that direct numerical integration presents a viable alternative. 
With errors (at least in principle) independent of dimension, Monte Carlo-based 
integration methods have thus become the mode of choice for marginal likelihood 
estimation across a diverse range of scientific disciplines, from evolutionary biol- 
ogy (Xie et al., 2011; Arima & Tardella, 2012; Baele et al., 2012) and cosmology 
(Mukherjee et al., 2006; Kilbinger et al., 2010) to quantitative finance (Li et al., 
2011) and sociology (Caimo & Friel, 2012). 

1.1 Monte Carlo-Based Integration Methods 

With the bulk of posterior mass most often constrained within a far smaller 
volume than that of the prior, the simplest marginal likelihood estimators drawing 
solely from 7r{9\y) or tt{9) cannot be relied upon for model selection purposes. In 
particular, the harmonic mean estimator (HME; Newton & Raftery 1994), 



(2) Z 



-1 

^- J]l/n/L(y|0,)' 



for 0i~7r(^|y), 



suffers from having infinite variance under common modeling conditions, mean- 
ing that its convergence towards the true Z as a one-sided a-stable limit law can 
be incredibly slow (Wolpert & Schmidler, 2012). Even when 'robustified' as per 
Gelfand & Dey (1994) or Raftery et al. (2007), however, the HME remains no- 
tably insensitive to changes in it{9), whereas Z itself is characteristically sensitive 
(Robert & Wraith, 2009; Friel & Wyse, 2012). Though assuredly finite by default 
the small-sample variance of the prior arithmetic mean estimator (AME), 

n 

(3) Z^ = Y1 ^iy\^i)/^ for Oi ~ Tr{9), 

i=l 

on the other hand, remains notoriously large, with huge sample sizes again nec- 
essary to achieve reasonable accuracy (cf. Neal 1999). 

A wealth of more sophisticated integration methods have lately been developed 
for generating reliable estimates of the marginal likelihood (see Robert &: Wraith 
2009 and Friel & Wyse 2012 for recent reviews). These include: annealed impor- 
tance sampling (Neal, 2001), bridge sampling (Meng & Wong, 1996), [ordinary] 
importance sampling (cf. Liu 2001), path sampling/thermodynamic integration 
(Gelman & Meng, 1998; Lartillot & Philippe, 2006; Friel & Pettitt, 2008; Calder- 
head & Girolami, 2009), nested sampling (Skilling, 2006; Feroz & Hobson, 2008), 
nested importance sampling (Chopin &: Robert, 2010), reverse logistic regression 
(Geyer, 1994), sequential Monte Carlo (SMC; Cappe et al. 2004; Del Moral et 
al. 2006), and the density of states (Habeck, 2012; Tan et al., 2012). A common 
thread running through all these schemes is the aim for a superior exploration 
of the relevant model space via 'guided' transitions across a sequence of interme- 
diate distributions bridging their Tr{9) and TT{9\y) extremes. (Or, more generally, 
their h{9) and vr(0|y) extremes if a suitable auxiliary /reference density, h{9), 
is available to facilitate the integration; cf. Lefebvre et al. 2010.) However, the 
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exact bridging paths specified may be quite dissimilar. Nested sampHng, for in- 
stance, evolves its particles over a sequence of constrained-likelihood distributions, 
f{9) oc 7r{9)I{L{y\6) > Ln^), transitioning from the prior (Lum = 0) through to 
the (vicinity of the) peak likelihood (Lum ~ -Z^max — e); while thermodynamic in- 
tegration draws progressively (via Markov Chain Monte Carlo [MCMC]; Tierney 
1994) from the family of 'power posteriors', 

(4) 7rt{9\y) « 7r{9)L{y\e)\ 

most explicitly connecting prior, 7r(^) [t = 0], to posterior, vr(0|y) = 1]. 

Another key point of comparison between these rival Monte Carlo techniques 
lies in their choice of identity by which the evidence is ultimately computed. The 
[geometric] path sampling identity, 

(5) logZ= [' E^,{logL{y\9)}dt, 

Jo 

for instance, is shared across both thermodynamic integration and SMC, in ad- 
dition to its namesake. However, SMC can also be run with the "stepping-stone" 
solution (cf. Xie et al. 2011), 

m 

(6) Z = where t(l) = and t(m) = 1, 

i=2 

with {t{j) : J = 1, . . . , m} indexing an often parametric sequence of ("tempered") 
bridging densities (either pre-defined or drawn stochastically from some vr(t); 
Gelman Sz Meng 1998), and indeed this is the mode prefered by experienced 
practitioners (e.g. Del Moral et al. 2006). Yet another identity for computing 
the marginal likelihood is that of the recursive pathway, characterised by reverse 
logistic regression (RLR) and the density of states (DOS). 

By recursive we mean that, algorithmically, the estimator may be run such that 
the desired Z is obtained through backwards induction of the complete sequence 
of intermediate normalizing constants, ■^t(j), corresponding to the m indexed 
bridging densities by supposing these Z^i^j^ to be already known. That is, a stable 
solution may be found in a Gauss-Seidel-type manner by starting with a guess of 
each normalizing constant as input to a convex system of equations for updat- 
ing these guesses, returning the new output as input to the same equations, and 
iterating until convergence. In fact, although the RLR and the DOS approaches 
differ vastly in concept and derivation — the former emerging from considerations 
of the reweighting mixtures problem in applied statistics (Geyer & Thompson, 
1992; Geyer, 1994; Chen & Shao, 1997; Kong et al., 2003) and the latter from 
computational strategies for free energy estimation in physics/chemistry/biology 
(Ferrenberg &: Swendsen, 1989; Kumar et al., 1992; Shirts &: Chodera, 2008; 
Habeck, 2012; Tan et al., 2012) — both may be seen to recover the same algorith- 
mic form in practice. To illustrate this equivalence, and to explain further the 
recursive pathway to marginal likelihood estimation, we describe each in detail 
below. 

1.2 Reverse Logistic Regression 

In the reweighting mixtures problem (cf. Geyer & Thompson 1992 and Geyer 
1994) the aim is to discover an efficient proposal density for use in the impor- 
tance sampling of an arbitrary target about which little is known a priori. Geyer's 
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solution was to suggest sampling not from a single density of standard form, 
but rather from an ensemble of different densities, qt(j)iO) = ft{j){^)/Zt(j) for 
j = 1, . . . , m for completely known ft(j){0) and typically unknown — as may 
be achieved, for instance, via Metropolis-coupled MCMC (MC^; e.g. Geyer 1992). 
The pooled draws, {9^^ : i = 1, . . . , n(j); j = 1, . . . , m}, are then to be treated 
as if from a single pseudo- mixture density, with each free normalizing constant, 
and hence the appropriate weighting scheme, to be derived (up to a constant of 
proportionality) via RLR. For our purposes of marginal likelihood estimation, 
we will suppose herein that the normalization is known absolutely for at least 
one auxiliary/reference density in the sequence (assigned t{l) = without loss 
of generality), such that qo{6) = h{6) [Zq = 1]. If t{m) = 1 indexes the Bayesian 
posterior, i.e., qi{0) = Tr{9)L{y\6)/Z \Z\ = Z], then the desired marginal likeli- 
hood is rendered directly from the RLR solution for this maximal temperature 
(Equation 11, as described below), otherwise its recovery will require one further 
(but trivial) importance sampling step (Equation 16). Following Geyer (1994), 
we write our pseudo-mixture density, p{9), in the form, 

m 

(7) p(e) = 5^[n(s)/n][/i(,)(0)/Z,(,)], 

s=l 

where n(s) represents the sample size at each temperature and n = YlT=i '^(s)- 

Before continuing it is important to acknowledge one particular caveat on 
the construction oi p{9). Namely, that if labelling were available, the observed 
proportions in a multinomial draw from this ensemble would be unlikely to match 
their design constants, n{s)/n, though we proceed as if they would; hence, our 
stress of the qualifier, pseudo-mixture density. 

With this simplification the [quasi-] log-likelihood of every being drawn 
from its true q^^j-^ (9) becomes 

(8) logL({# :i = l,...,n(i); j = 1, . . . ,m}\{Zt^^), . . . , Z^^^)}) = 

m n{j) 

EEi°g(/*a)(^?)/%) /P(0?)). 
j=i i=i 

Setting the partial derivative in each unknown Z^^^) (k = 2, . . . , m) to zero yields 
the series of convex equations defining the RLR estimator as follows: 

(9) d/dZt^k) logL({0p) : i = 1, . . . ,n(j); j = l,.. . ,m}|{Zt(i), . . . , Zt(^)}) = 

n{k) m n(j) / m \ 

-^1/Z,(,) [-n(fc)/t(fc)(#)/z2,)]/[j;n(s)/,(,)(0pV^*{s)] 

1=1 j=l i=l \ s=l J 

m n{j) / m \ 

(10) ^ = EE Aw(^?)/E^(«)/*w(^?^)/^*w] 

j = l 1=1 \ s = l J 



(11) ^ ^t(fc) = E fmio^)/[J2<^)Msm/Ztis) 

i=l \ s=l 
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The point of explicitly writing out the last step here is to highlight that having 
derived the 'labelled' estimator given by Equation 10 we can now 'lose the labels', 
(j), on our draws without loss of accuracy. As Kong et al. (2003) explains, 
"under the model as specified ... the association of draws with distribution labels 
is uninformative. The reason for this is that all the information in the labels for 
estimating the ratios is contained in the design constants, {f^(l), . . . ,n(m)}". 

To solve numerically for the full set of unknown Z^(fc) {k = 2, . . . , m) one may 
simply proceed via the Gauss-Seidel type recursive pathway noted earlier; that is, 
start with a guess of each as input to the convex system for updating these 
guesses (defined by Equation 11), return the new output as input to the same, and 
iterate until convergence. Geyer (1994) credits Alun Thomas for suggesting this 
approach in the RLR context, while Kong et al. (2003) points to Vardi (1985) for 
a more general theoretical justification. Key asymptotic convergence proofs with 
necessary conditions for the uniqueness of the RLR solution are given by Geyer 
(1994), Chen & Shao (1997), and Kong et al. (2003). Perhaps the most important 
of the latter to stress is that for identifiability. Defining a separable sample as 
one for which there exist disjoint subsets P and Q of {1, . . . ,m} such that for 
each 6i and each p £ P and q £ Q either ft(p){Oi) = or ft(q){Oi) = 0, Geyer 
(1994) demonstrates that the solutions to Equation 11 are unique if and only if 
the collection of pooled draws, {6i : z = 1, . . . , n}, is strictly inseparable. (In fact, 
Geyer states uniqueness up to an additive constant, but we may drop this qualifier 
for the present case where Zq = 1 has been assumed.) The design of the bridging 
sequence so as to ensure, or at least render highly likely, the inseparability of the 
drawn Oi is in effect the only constraint on our choice of ensemble — in particular, 
the adopted qt{k){S) need not match the support of 7r(6'|y) individually, though 
their union must (Kong et al., 2003). 

The gradient and Hessian of the RLR [quasi-] log-likelihood are also available 
analytically as 

n 

(12) aiogL/aiogZ,(fe) = ^(/{0 G %(fc)(^)} - 1), and 

i=l 

n 

(13) a2iogL/5iogZi(fc)2 = Y,^{e£ qtik){om{e g %(fc)(e)} - 1), 

1=1 

n 

(14) a2logL/aiogZi(,)aiogZi(,) =Y.l{9£ qtik){0)]l{e e qtis){0)}, 

i=l 

respectively, where l{e G qt{k){9)} = ft{k){0)[n{s)/n]/Zti^k)/p{9)- Aside from of- 
fering a non-recursive pathway to Z via downhill gradient search from a starting 
position sufficiently near to the true .^t(fc) (as implemented by Tan et al. 2012, 
for example, using a trust region algorithm in R), the above equations also facili- 
tate an estimate of the uncertainty associated with the recovered log .^t(fc) . Geyer 
(1994) gives a generic limiting form for this uncertainty in his asymptotic Nor- 
mality theorem for the RLR, while Kong et al. (2003) and Tan et al. (2012) offer 
more practical matrix constructions (the latter designed to avoid the burden of 
generalized matrix inversion). 

A 'naive' alternative derivation of the RLR estimator relevant to the thermo- 
dynamic integration via importance sampling methodology we describe in Section 



6 



CAMERON & PETTITT 



2 is that given by Jiang &: Tanner (2003) in their discussion of Kong et al. (2003): 
simply take p{9) as a pseudo-importance samphng density for each qt[k){d), such 
that 

(15) = / ftik){e)/p{d){p{e)de}, 

Jn(e) 

and solve recursively, 

n 

(16) ii(fc) = ifmm/p{ei)/n) for 9i ~ p{e). 

i=l 

The finite variance of the RLR estimator under this construction may be seen as 
a consequence of its implicit use of defensive importance sampling (Hesterberg, 
1991) whereby every target density is itself contained within the proposal density. 
The (original) stabilized harmonic mean estimator (SHME; p^) of Newton Sz 
Raftery (1994) is in fact based on exactly the approach described by Jiang & 
Tanner, but with a further step of normalization by the sum of weights (serving 
only to speed up its iterative convergence). If 7r{0\y) is not expressly contained 
within the proposed qt(^k) (^) family, then as mentioned earlier a final importance 
sampling step will be needed to recover the desired Z — for this the relevant 
formula is given by Equation 16 with ft[k){^i) replaced by ■K{6i)L{y\9i). 

1.3 The Density of States 

Yet another construction of the convex series of .^((/t) updates characterizing 
the RLR appoach (cf. Equation 11) has recently been demonstrated in the context 
of free energy estimation for molecular interactions by Habeck (2012) and Tan et 
al. (2012). In this alternative method rather than aiming directly for estimation of 
the marginal likelihood one aims instead to reconstruct a closely-related measure, 
the density of states (DOS), ^(e), "defined" in terms of a composition of the Dirac 
delta 'function', 5{-), as 

(17) g{e) = 1 7r{9)d{e + log L{y\e))d9. 

Our somewhat uncharitable placement of quotation marks in the above is in- 
tended to draw attention to a crucial point overlooked in these previous studies; 
namely, that the composition of the Dirac delta 'function' — which is itself not 
strictly a function, being definable only as a measure or a generalized function — 
lacks an intrinsic definition. Hormander (1983) proposes a definition in M" valid 
only when the composing function, here v{9) = £^-|-log L{y\9), is continuously dif- 
ferentiable and dv{9)/d9 nowhere zero; the latter condition in particular must be 
considered problematic in the general case that L[y\9) features a global maximum 
on ^{9), and disasterous in the specific case that L{y\9) features a constant set of 
non-zero measure with respect to it{9)\ Here we suggest an alternative definition 
of the DOS as the derivative of an inverse survival function, which elucidates its 
connection to nested sampling. 

Chopin &: Robert (2010) characterize nested sampling by supposing first the 
invertibility of the survival function, (j)~^{l), where 



(18) 



r\l)=pr{L{y\9)>l}(oT9r^7r{9), 
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which they ensure by stipulating L{y\6) continuous and the support of 7r(0) con- 
nected. With 4'{0) and so defined the nested sampling identity may be 
written as 

(19) ^= f (P{x)dx. 

Jo 

Now, consider the distribution function defined in terms of the free energy, E{6) = 
— logL(y|6') (i.e., e = — log?), such that 

(20) G{e) = pt{E{9) < e} for ~ 7r{9), 

from which the DOS may be defined for differentiable G(e) (again, L(^y\9) can 
have no set of constant values of non-zero measure) as 

(21) g{e) = dG{e)/de. 
Since G(e) = (/)~^{exjp[—e]) we have 

(22) g{e) = d^-^{exp[-e])/de = d<j)-^{l)/dl x dl/de = dcp'^{l)/dl x -exp[-e]. 

Transforming the nested sampling identity of Equation 19 by substitution with 
yields 

(23) Z = I (l){x)dx= / X d(l)-\l)/dl X dl, 
and hence, 

/•O /-oo 

(24) Z = I I X —g{e) exp[e] x dl = exp[— e] x —g{e) exp[e] x dl/de x de. 

J oo J — oo 

That is, consistent with the requirements of Habeck (2012) and Tan et al. (2012), 
our DOS formulation returns the identity, 

/oo 
g{e) exp[—e]de. 
-oo 

To make use of this identity Habeck (2012) suggests sampling from a series 
of bridging densities indexed by t{k) as in the RLR method but with each fea- 
turing an explicit dependence on the free energy, — log L{y\9), such that qt(k){9) = 

[here we write f*{E{9)) to distinguish such specifically energy- 
dependent bridging sequences from their more general f{9) counterparts in "or- 
dinary" RLR]. Accordingly each Zj(fc) may be written in terms of the DOS as 

(26) Z,(,)= / f;^^,^{E{9))^{9)d9 = r g{e)f:^,^{e)de. 

JQ{9) J —oo 

Treating the pool of energies, {Ei = E{9^p) : i = 1, . . . , n{j);j = 1, . . . , m}, cor- 
responding to the pool of 9i draws, as a single simulation of length, n, from the 
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pseudo-mixture, p{e) = YlT=i[''^i^)/''^][9i^)ft{s)(^)/^t{s)]j one may construct di- 
rectly the recursive importance sampling estimator of each normalizing constant, 

n 

(27) =J29iE^)f:(^k)iE^)/p{E^)/n 

i=l 

n / m \ 

= Y,^9{E^)f:(^k)iE^)/[Y.[n{s)g{E,)f:^^^{E,)/Z,(^,)] 

i=l \ s=l / 

n / m \ 

i=l \ s=l ) 

Surprisingly, despite having based our derivation of ^t(fc) on the DOS we never 
need know or even directly estimate g{e) as it cancels out in the final step! 
Inspection of Equation 27 confirms that the DOS solution for Z^(fc) (fc = 2, . . . , m) 
matches exactly that of the RLR with ft{k){(^) = ft{k)^E{9))TT{9). 

As Habeck (2012) has insightfully highlighted, despite the restriction that the 
bridging distributions of the DOS take the form qt(k)iG) = ft{^k)^E{9))-K{6)/Z^^j^-^, 
the sampling schemes of both thermodynamic integration along the power pos- 
teriors path and nested sampling can be easily accommodated within this frame- 
work given astute choices for f*^j^-^{E{9)) — namely, f*^j^-^{E{9)) = exp[—t{k)E{9)] 
for the former and f*^^^{E{9)) = H{E^f^'j — E{9)) with H{-) the Heaviside step 

function and = E{9^ ^■*) for the latter. In principle one could even pool 

draws from both schemes together under the DOS framework to construct a joint 
power posteriors-nested sampling approximation to the marginal likelihood — 
with the sub-sample estimates of Z based on each separately providing a check 
on the computations leading to their combined solution; though the computa- 
tional complexity of such a scheme is unlikely to be attractive to many users. 
Finally, it is perhaps also important to note here that our definition of G{e) as 
a probability with respect to 'k{9) may of course be relaxed to become a proba- 
bility with respect to some generic reference distribution, h{9), [assumed closer 
to 7r(0|y) than 7r(0)] as a means of improving the efficiency of such DOS-based 
integration. 

1.4 Prior-Sensitivity Analysis 

In the Bayesian framework (Jeffreys, 1961; Jaynes, 2003) the ratio of marginal 
likelihoods under rival hypotheses (i.e., the Bayes factor) operates directly on the 
prior odds ratio for model selection, as 

(28) Ti{Mi\y)/Ti{M2\y) = [7r(Mi)/^(M2)] [^(y|Mi)/^(y|M2)] 

= Zm(1)/Zm{2)T^{Mi)/tt{M2). 

A much maligned feature of the marginal likelihood in this context is its possible 
sensitivity to the choice of the parameter priors, 'k{9\Mi) and 'k{9\M2). When 
there is limited information or theoretical motivation available to inform this 
choice the resulting Bayes factor can appear arbitrary in value. (On the other 
hand, viewed as a quantitative implementation of Ockham's Razor, the key role 
of prior precision may well serve as strong justification for the use of Bayesian 
model selection in the scientific context; cf. Jeffreys &: Berger 1991.) In their 
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influential treatise on this topic Kass &: Raftery (1995) thus argue that some 
form of prior-sensitivity analysis be conducted as a routine part of all Bayesian 
model choice experiments; their default recommendation being the recomputation 
of the Bayes factor under a doubling and halving of key hyperparameters. 

If the initial marginal likelihoods have been computed under an amenable sam- 
pling scheme then, as Chopin &: Robert (2010) point out for the case of nested 
importance sampling, alternative Bayes factors under (moderate) prior rescalings 
may be easily recovered by appropriately reweighting the existing draws, without 
the need to incur further (computationally expensive) likelihood function calls; 
and indeed the RLR method was developed specifically to facilitate such compu- 
tations (though in the reweighting mixtures context; Geyer & Thompson 1992; 
Geyer 1994). That is, having recovered estimates for each .Z^(fc) under our nominal 
prior the pooled draw pseudo-mixture density, p{9) = YlT=i[''^i^) / ^]lt(k){(^) ^ ^ow 
offers by design an efficient proposal for importance sampling of many other tar- 
gets near the posterior. The alternative marginal likelihood estimate, Z^xt, under 
alternative prior, 7rait(^), simply becomes 

n 

(29) 4it = Yl Hy\GiM0i)/pi9i)/n. 

1=1 

The stability of this importance sampling procedure may then be monitored via 
the effective sample size, ESS = n/[l + varp{7rait(^)/p(6')}], following Kong et al. 
(1994). 

2. THERMODYNAMIC INTEGRATION VIA IMPORTANCE SAMPLING 

Inspired by the recursive pathway (of the RLR and DOS) we present here 
yet another such strategy for marginal likelihood estimation, which we name, 
'thermodynamic integration via importance sampling' (TIVIS). Although quite 
novel at face value it is easily shown to be yet another manifestation of the RLR 
methodology; yet by effectively recasting the RLR as a thermodynamic integra- 
tion procedure we attain insight into the relationship between its error budget 
and the choice of bridging sequence. Specifically, the error in the estimation of 
each Z^(fc) may be thought of as dependent on both the J-divergence (Lefebvre et 
al., 2010) between it and the remainder of the ensemble (via the thermodynamic 
identity) and on the accuracy of our estimates for those other {j ^ k). Thus, 
we suggest that an effective choice of bridging sequence will produce a near equal 
spacing of logZ^(fc) (a proposal we explore numerically in Section 3 below). 

To construct the TIVIS estimator we once again assume the availability of 
pooled draws, {^p'' -.1 = 1,..., n{j); j = 1, . . . , m}, from a sequence of bridging 
densities, qt(j){0) = ft(j){0) / Zt(^j) (j = 1, . . . , m), with each ft(k){0) known exactly. 
Moreover, we suppose that t(l) = indexes a normalized reference/auxiliary, 
7r(0) or h{9), such that Zi = 1 is known, but with the remaining typically 
unknown. Despite our subsequent use of the thermodynamic identity, however, 
we do not require here that the bridging densities follow the geometric path be- 
tween these two extremes. Now, rather than seek each Z^^/j) via direct importance 
sampling from p{0) as per the RLR, the TIVIS method is to instead seek each 
normalization constant via thermodynamic integration from its preceding density 
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in the ensemble, qt{k~i){^)-, using the identity, 

(30) logV-)= f K>^{^og{f,^k){0)/qtik-i){0))}dx, 

Jo 

where 7r^(e) oc oc [ftik)iO)nMk-i)iO)?-'' ■ (Here we have 

adopted the unconventional notation, x, for the thermodynamic integration vari- 
able to avoid confusion with the t{k) of the bridging densities.) For existence of 
the log-ratio in Equation 30 we must impose the strict condition (not necessary 
for ordinary RLR) that all qt(k){0) share matching supports. Importance sampling 
from p{6) allows construction of the appropriate (but unnormalized) weighting 
function, 

(31) w{9,x) = [ftik){o)r[ftik-im'-vp{o), 

which in substitution to Equation 30 yields the TIVIS estimator, 
(32) 

/in n 
E log {fm{Oi)/ftik-i){Oi)) w{ei,x)]/[Y, wiOi, x)]dx. 

- i = l 2 = 1 

In computational terms, numerical solution of the one-dimension integral in the 
above may be achieved to arbitrary accuracy by simply evaluating the integrand 
at sufficiently many xj on the unit interval, followed by summation with Simp- 
son's rule. If the sequence of bridging densities is well-chosen (and suitably or- 
dered) the J-divergence between each qt(k)i0) &ud qt[k-i){^) pairing should be 
far less than that between prior and posterior, such that a naive regular spacing 
of the Xj will suffice. 

To show the equivalence between this estimator and that of the RLR defined 
by Equation 11 we simply observe that the derivative of the denominator in 
Equation 32 equals the numerator, and thus by analogy to u' {x) / u{x)dx = 
logu(l) — logn(O) we have 



iog(4(fc)/4(fc-i)) = \og[Y,fm{.Oi)/vm - \og[Y,ft[k-i){Oi)/pm 

i=l i=l 
n n 

^ log(4(fc)/Zj(,_i)) = \og[Y,fm{Oi)/p{ei)/n\ - \og[Y,ft{k-l){0^)|p{ei)/n\ 

i=l i=l 

n 

(33) ^ log4(fc) = log[E !m{Qi)im)ln\. 

i=\ 

In the following two case studies we explore by numerical example both the 
diversity of computational implementions of the recursive pathway (with partic- 
ular reference to their efficiency in L(%j\Q^ calls; Section 3) and the utility of these 
RLR- normalized bridging sequences for testing prior-sensitivity (Section 4). 

3. CASE STUDY: BANANA-SHAPED PSEUDO-LIKELIHOOD 

FUNCTION 



For our first case study we demonstrate application of the recursive pathway 
to estimation of the marginal likelihood for a banana-shaped pseudo-likelihood 
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function [hence, we write here L{9) instead of L{y\9)] in two-dimensions, defined 

as 

(34) L{e = {01, 02}') = exp[-(10 x (0.45 - ei)f/A - (20 x {62/2 - et))% 

with a Uniform prior density of tt{0) = 1/4 on the rectangular domain, [—0.5, 1.5] x 
[—0.5,1.5]. A simple illustration of this L(9) as a (logarithmically-spaced) con- 
tour plot is presented in the left-hand panel of Figure 1. Brute- force numeri- 
cal integration via quadrature returns the "exact" solution, Z = 0.01569[6] (or 
logZ = -4.154[3]). 



Banana-Shaped RLR/DOS/TIVIS 




t(k) ={0,0.015,0.125,0.42,1} 



I 1 1 1 1 _5 L I ! ! ! 1 

-0.5 0.0 0.5 1.0 1.5 10^ 3x10^ 10"* 3x10" 10^ 

01 n 



Fig 1. The banana- shaped pseudo-likelihood function of our first case study (Equation 34 of 
Section 3) is illustrated graphically here (left-hand panel) as a (logarithmically-spaced) con- 
tour plot on the domain of our Uniform prior, [—0.5,1.5] x [—0.5,1.5]. Convergence of the 
RLR/DOS/TIVIS estimator for the corresponding marginal likelihood under Metropolis-coupled 
MCMC sampling of the power posterior (at five pre-specified temperatures) as a function of the 
total sample size is shown in the right-hand panel. The marked points and error bars on this 
Figure indicate respectively the recovered mean and standard error in log Z for thirty trials at 
each n. The dash-dotted, yellow line indicates the "exact" \ogZ for this example derived via 
brute-force quadrature, and the dotted, red lines highlight the ^l/y^n convergence rate. 



As a benchmark of the method we first apply the RLR/DOS/TIVIS estima- 
tor to samples drawn from a sequence of bridging densities following the power 
posteriors path. Though even a cursory inspection of the pseudo-likelihood func- 
tion for this simple case study is sufficient to confirm its unimodality and to 
motivate a family of suitable proposal densities for straight-forward importance 
sampling of '/rt(0) cx 7r(^)L(0)*, for illustrative purposes we have chosen to imple- 
ment an yiC^ (Geyer, 1994) approach here instead; the latter being ultimately 
amenable to a much wider variety of Bayesian analysis problems than the for- 
mer. With regard to the choice of tempering schedule, for which we first trial a 
thermodynamic integration-inspired scaling, we refer to Gelman &: Ivleng (1998), 
who note pragmatically that a priori estimation of the optimal 7r(t) will often 
be far more difficult than the estimation of Z itself; thus we simply follow Friel 
& Pettitt (2008) in adopting a pre-determined set of (five) temperatures spaced 
geometrically as t = {(0, 1/4, 1/2, 3/4, l)^}. 

To quantify the behavior of the RLR/DOS/TIVS estimator under this power 
posteriors sampling strategy we have run the above computational experiment 
thirty times at each of five total sample sizes spaced as n = {10'^, 3 x 10^, 10^, 3 x 
10^, 10^}; the results of which are illustrated in the right-hand panel of Figure 1. 
Code for reproducing our experiment using the RLR/DOS and TIVIS schemes 
alternately (cf. Sections 1.2/1.3 and 2) is available from the first author upon 
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request. As demonstrated by Geyer (1994), whether for MetropoUs-coupled or 
regular MCMC samphng, the RLR estimator wih converge asymptotically to the 
true log Z with a standard error decreasing as 1/ yjn\ and indeed we can confirm 
that such a scaling with n is already evident in the standard errors recovered from 
the present example (see Figure 1). More importantly, we can also confirm a close 
agreement between this mean standard error (computed from repeat simulation) 
and the estimated asymptotic standard deviation of logZ computed using the 
matrix form from Kong et al. (2003) [their Equation 4.2], highlighting the utility 
of the latter for efficient uncertainty estimation. 

Finally, we can use the derived sequence of logZj(;j) to check the suitability 
of our adopted tempering schedule for the present case study; with the expecta- 
tion according to our TIVIS formulation that an efficient pathway will produce 
a near equal spacing. In this example for i = {(0,1/4,1/2,3/4,1)'^} we recover 
logZt^k) ~ {0,-0.9, -2.3,-3.3,-4.2}— i.e., a spacing of {-0.9, -1.4, -1, -0.9} 
with sample standard deviation, (^logZy, ~ ^'^^ — giving a standard deviation 
of logZ about the true logZ of = 0.10 at n = 10,000. By way of com- 
parison, for alternative temperature exponents, {1,2,4,5}, we recover o'^^gZ]^ = 
{1.1, 0.44, 0.38, 0.55} with = {0.19, 0.16, 0.12, 0.13}— confirming that our ther- 
modynamic integration-inspired, aj^^g^i^'^^i^i^i^S choice of temperature sequence 
was indeed most efficient. 

With this power posteriors version of RLR as benchmark we now consider the 
merits of two alternative schemes for defining, and sampling from, the required 
sequence of bridging densities, qt{k)iP)i i^i Sections 3.1 and 3.2 below. 

3.1 Thermodynamic Integration from a Reference/Auxiliary Density 

As highlighted by Lefebvre et al. (2010), the error budget of thermodynamic 
integration over the geometric path depends to first-order upon the J-divergence 
between the reference/auxiliary density, h{9), and the target, 7r(0|y). Thus, it 
will generally be more efficient to set a 'data-driven' h{9) — such as may be re- 
covered from the position and local curvature of the posterior mode — than to 
integrate 'naively' from the prior, i.e., h{9) = tt{9). Here we demonstrate the 
corresponding improvement to the performance of the RLR/DOS/TIVIS estima- 
tor resulting from the relevant choices, h{9) ~ Trrunc. (/^modei 5]"^^^^,) and h{9) ~ 
A/rruncX/^mode, ^^mode)" ^^^^ Trrunc. and VVxrunc. denote the two-dimensional Stu- 
dent's t {v = 1) and Normal distributions (truncated to our prior support), 
respectively, while //mode denotes the posterior mode and Smode its local curva- 
ture (recovered here analytically, but estimable at minimal cost in many Bayesian 
analysis problems via standard numerical methods). As before we apply MC^ to 
explore the tempered posterior and run both experiments thirty times at each 
of our five n. In contrast to the power posteriors case we adopt here a regular 
temperature grid, t = {0,0.25,0.5,0.75,1}, to allow for the imposed/intended 
similiarity between n{9\y) and h{9). Our results are presented in Figure 2 and 
discussed below. 

As expected from both theoretical considerations (Gelman &: Meng, 1998; 
Lefebvre et al., 2010), and reports of practical experience with other marginal 
likelihood estimators (Fan et al., 2012), use of a 'data-driven' auxiliary in this ex- 
ample has indeed reduced markedly the standard error of the RLR/DOS/TIVIS 
scheme (at fixed n) with respect to that of the naive (power posteriors) path. 
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Fig 2. Convergence of the RLR/DOS/TIVIS estimator for the marginal likelihood of our banana- 
shaped pseudo-likelihood function under Metropolis-coupled MCMC sampling ( at five pre-specified 
temperatures) on the geometric path between a 'data-driven' reference/auxiliary density, h{6), 
and the posterior, shown as a function of the total sample size. The adopted h{6) takes a two- 
dimensional Student's t form in the left panel and a Normal form in the right, with its controlling 
parameters f/imodc and EmodeJ in each case set to the location and curvature of the posterior 
mode. A marked reduction in standard error (at fixed n) with respect to that of the naive (power 
posteriors) path, i.e., h{6) — n{9), is evident from comparison with Figure 1. 



i.e., h{6) = tt{6). Moreover, in this instance the (thinner-tailed) Normal aux- 
iliary form has out-performed the (fatter-tailed) Student's t (one d.o.f.), which 
is again consistent with theoretical expectations as a quick computation using 
the "exact" \ogZ confirms J[A/'(/imodc, S~J,j^), /i(0)] < J[T(^modo, S^odc)' ^(^)]- 
(Unfortunately, no such 'precise' optimisation of this form for h{9) is possible 
a priori without knowledge of the desired Z; and even a crude estimator of the 
J-divergence run with, e.g., the Laplace approximation to the marginal likelihood 
will nevertheless add numerous extra likelihood evaluations to the computational 
budget.) 

3.2 Ellipse/Ellipsoid-Based Nested Sampling 

Recalling our derivation of the DOS pathway to the marginal likelihood, which 
began with the fundamental identity of the rival nested sampling algorithm (see 
Section 1.3 and Equation 19 above), it is of some interest to compare directly the 
performance of these two particular methodologies. The present case study with 
its Uniform prior density is in fact well suited to this purpose since in the field of 
cosmological model selection, where nested sampling has been most extensively- 
used of late (Mukherjee et al., 2006; Feroz & Hobson, 2008), it is standard practice 
to adopt separable priors from which a Uniform sample space may be easily 
constructed under the simple quantile function transformation; which, for the 
discussion below, we assume has been done such that t^{9) may be taken as strictly 
Uniform on [0, 1]^. Given these conditions Mukherjee et al. (2006) outline a 
crude-but-effective scheme for exploring the constrained-likelihood space of nested 
sampling, in which the new "live" particle for each update must be drawn with 
density proportional to t:{6)I{L{6) > L{9i-i)). 

Under the Mukherjee et al. (2006) scheme, to draw the required Oi one sim- 
ply: (i) identifies the minimum bounding ellipse (or with D{9) > 2, the min- 
imum bounding ellipsoid) for the present set of "live" particles; (ii) expands 
this ellipse by a small factor ~ 1.5-2 with the aim of enclosing the full support 
of I{L{9) > L(0j_i)); and (ill) draws randomly from its interior until a valid 
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{9i,L{6i)} is discovered. Supposing the elliptical sampling window thus defined 
has been enlarged sufficiently to fully enclose the desired likelihood surface (which 
it must do to ensure unbiased sampling of {6i, L{6i)}, although we can rarely be 
sure that it has) it remains unlikely to match its shape exactly, leading to an 
overhead of discarded draws, {0^'* : L{9\^'^) < L{9i-i), j = 1, . . . ,noh}- At 
each Oi the incurred n^h may be thought of as a single realization of the nega- 
tive binomial distribution with p equal to the fraction of the bounded ellipse for 
which L[0) < L{6i-i); hence, S(noh) = 1- The magnitude of this overhead 

can in general be expected to scale with the geometric volume of the parameter 
space, potentially limiting the utility of this otherwise dimensionally-insensitive 
Monte Carlo-based estimator. However, where applicable the Mukherjee et al. 
(2006) scheme may well prove more efficient than the alternative of constrained- 
MCMC-sampling (cf. Friel & Wyse 2012) in which one must discard at least 
~10-20 burn-in moves (each with a necessary L{9) call) per accepted 6i to achieve 
approximate stationarity. 

Applying this ellipse-based approach to nested sampling of the pseudo- likelihood 
function of Equation 34 (with n/7 live particles in each case and a small extrap- 
olation of the mean Liive times exp(— 7) at the final step; cf. Skilling 2006) we 
recover a convergence to the true log Z (as shown in Figure 3) of efficiency (in 
n) comparable to that of the RLR/DOS/TIVIS estimator run over the geometric 
path with Student's t auxiliary (Figure 2, left-hand panel). However, the over- 
head of Hoh ~ 3.4 total likelihood calls observed here on average per accepted 6i 
should be a concern for applications of nested sampling in which the likelihood 
function may be genuinely expensive to evaluate; indeed for modern cosmolog- 
ical simulations MCMC exploration of the D{9) < 12 posterior is effectively a 
super-computer-only exercise due solely to the cost of solving for L{y\9). [At this 
point the skeptical reader might suggest that the distinctly non-elliptical L{9) 
considered in this example be considered a particularly "unfair" case for testing 
the Mukherjee et al. (2006) method, but such banana-shaped likelihoods are in 
fact quite common in higher-order cosmological models; see, for instance, Davis 
et al. 2007.] We therefore suggest that in general one might improve upon the 
efficiency of ellipse-based nested sampling by co-opting its bridging sequence into 
the RLR framework under 'soft' sampling of the likelihood constraint, and in the 
following we give one such specific example. 

By 'soft' sampling we simply mean sampling not subject to hard likelihood 
thresholding; the simplest version of which would be to follow the original nested 
sampling path, drawing as usual from each proposal ellipse until a suitable re- 
placement point is discovered but without discarding the lower likelihood draws 
at each step. The resulting qt(k){G) ensemble for k > 1 then takes the form, 
qt(k){0) (X I{9 G S/Z[Sii^e(fc)]), instead of %(fc)(0) (x I{9 G Ell[Ex^^^^k)]) x I{L{9i) > 
L{9i-i)), where ii^//[£'iive(fc)] denotes the minimum bounding ellipse/ellipsoid for 
the current set of live particles. Computational experiments on the present case 
study for N = 142 live particles (n ~ 1000) confirm the superior accuracy of 
this approach over direct nested sampling; with a mean and standard error from 
repeat simulation of logZ = —4.15 it 0.09 for the former, and —4.13 it 0.15 for 
the latter. However, with each nested sampling step contributing its own qt[k){(^) 
the computational burden of solving for the full set of Zk^j.-^ under RLR quickly 
becomes prohibitive at larger n. 
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Fig 3. The performance of nested sampling (left panel) as a marginal likelihood estimator for our 
banana-shaped pseudo-likelihood function, run under the ellipse-based strategy for exploring the 
sequence of constrained-likelihood densities proposed by Mukherjee et al. (2006); compared with 
that of the RLR/DOS (right panel) for our alternative scheme of 'soft' ellipse-based sampling 
(see Section 3.2). The two schemes converge to the true logZ at a similar rate in n (the number 
of draws from intermediate distributions compromising the final sample), but the former incurs 
a factor of ^3.4 more likelihood function calls in this case, owing to the often imprecise match 
between the proposal ellipse and the true constrained likelihood surface. This overhead of ellipse- 
based nested sampling can be expected to grow with the geometric volume; though for problems 
amenable to this technique (i.e., with separable priors) will likely remain preferable to the M CMC- 
based alternative (cf. Friel & Wyse 2012). 

Therefore, we propose the alternative strategy of moving up in fixed quan- 
tiles of the sampled likelihoods. For instance, (i) begin with n(l) draws from 
9t(fc=i)=o(^) = T^iS) [with t:{6) the unit hypercube under the quantile trans- 
formation], (ii) take a further n(2) draws via simple rejection sampling from 
Qt{k=2)iP) oc I{9 € Ell[Ei(^j._i-^]) where Ell[Ei(^j._i^\ now denotes the minimum 
bounding ellipse/ellipsoid for the upper third quantile of the previous £'j(^_]^) 
draws, and (iii) repeat (ii) a further m — 2 times. Much like nested sampling 
itself, for unimodal likelihoods this scheme will clearly act to evolve a series of 
nested ellipses in from the prior to the vicinity of the posterior mode; with mod- 
ification for the multi- modal case achievable via, e.g., fe-means clustering, as per 
Feroz &; Hobson (2008) 's implementation of the Mukherjee et al. (2006) scheme. 
Running this algorithm for ten steps with equal n{k) we are able to recover the 
marginal likelihood for our banana-shaped pseudo-likelihood function with very 
similar efficiency in n to that of nested sampling (as shown in Figure 3) — and 
thus (~3 times) superior efficiency in the number of total likelihood calls. 

Note that even if the prior were not amenable to the quantile transformation 
one could still apply our ellipse-based RLR method with the same efficiency in 
L[6) calls simply by running MCMC, without calls to L(-), to explore 'k{9)I{9 G 
£'ZZ[£'t(fc_i)]) at each k > 1 instead of the rejection sampling used in the present 
example. Indeed, the irony is that this may be in fact the only truly interesting 
case for RLR in such ellipse-based nested schemes since the normalization of every 
such gj(fc) [9) is otherwise trivially computable as the volume of the corresponding 
ellipsoid, Vg. These directly-computed .^t(fc) [= Vs\ can of course be employed 
directly to estimate Z as in the final importance sampling step for the RLR (cf. 
Section 1.2), treating the pooled 9i as if from the pseudo- mixture density, 



(35) 



m 

P{0) = Y.[n{s)/nm9 G Ell[E^,,^^,)]) /Vs]. 

s=l 
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For the N = 142 (n ~ 1000) experiment considered above this pseudo-mixture 
density scheme yields a surprisingly accurate mean and standard error of log Z = 
— 4.155ib0.015 (compared to — 4.15ib0.09 for the RLR). Further computational ex- 
periments are now underway to better quantify the advantages offered by this ap- 
proach to harnessing the information content of these otherwise discarded draws 
in the ellipse-based nested sampling paradigm (Feroz et al., in prep.). 

4. CASE STUDY: NORMAL MIXTURE MODELING OF THE GALAXY 

DATASET 

The well-known galaxy dataset, first proposed as a test case for kernel density 
estimation by Roeder (1990), consists of precise recession velocity measurements 
(in units of 1000 km s~^) for 82 galaxies in the Corona Borealis region of the 
Northern sky reported (as "the unfilled sample") by Postman et al. (1986). The 
purpose of the original astronomical study was to search — in light of a then 
recently discovered void in the neighboring Bootes field (Kirshner et al, 1981) — 
for further large-scale inhomogeneities in the distribution of galaxies, as evidence 
for the operation of hierarchical clustering processes on cosmological scales (Gunn 
& Gott, 1972). Given the well-defined selection function of their survey Postman 
et al. (1986) were easily able to compute as benchmark the recession velocity 
density function expected under the null hypothesis of a uniform distribution 
of galaxies throughout space; and by visual comparison of this density against 
a histogram of their observed velocities the astronomers were able to establish 
strong evidence against the null, as was their aim. 

However, as Roeder (1990) soon realized, under this favored hypothesis of an 
inhomogeneous galaxy distribution one can pose the far more challenging statis- 
tical question of "how many distinct clustering components are in fact present in 
the recession velocity dataset" ? Many authors have since attempted to answer this 
question as a univariate. Normal mixture modeling problem, with notable con- 
tributions in the Bayesian framework including those of Escobar & West (1995), 
Phihips & Smith (1996), Richardson & Green (1997) and Stephens (2000). The 
pre-Millennial contributions to this end being well summarized by Aitkin (2001), 
who highlights the extreme sensitivity to the specified priors of the inferred num- 
ber of components at the posterior mode. For this reason the galaxy dataset 
provides a most interesting test case for the utility of the recursive pathway with 
prior-sensitivity analysis as described in Section 1.4. 

In the following we detail the Normal mixture model adopted (Section 4.1), 
discuss various astronomical motivations for our priors (Section 4.2), run RLR on 
this problem with MC^ exploration of the posterior (Section 4.3), and explore the 
prior-sensitivity of our results with a comparison to previous analyses (Section 
4.4). 

4.1 Normal Mixture Model 

Following Richardson &: Green (1997) we write the /c-component Normal mix- 
ture model with component weights, w, in the latent allocation variable form for 
data vector, y, and (unobserved) allocation vector, z, such that 

(36) 7r(zj = j) = Wj and TT{yi\zi = j) = fAfiVilOj)- 

Here fj\f{-\Oj) represents the one-dimensional Normal density, which we will refer- 
ence in mean-precision syntax as Af{fij,T~^), i.e., 9j = {fj,j,Tj}'. The likelihood 
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function for independent m is thus recoverable via summation over the unob- 
served, 

n k 

(37) L{y\e, w,k) = WY. ^lf^fiy^\^j)■ 

i=l j=l 

Given priors for the number of components in the mixture, the weights at a 
given k, and the vector of mean-precisions — i.e., vr(A;), '7r{w\k), and Tr{9\w,k), 
respectively — the corresponding posterior in {k,w,9} becomes 

(38) 7r{k, w, 0\y) oc 'K{k)'K{w\k)TT{e\w, k)L{y\9, w, k), 

which we can integrate over {w,9} to yield, 7r(A;|y). To this end we suppose 
a strictly finite mixture and run MC'^+RLR over a simple fixed grid of A; G 
{3, . . . , 10}, estimating the series of (/^-conditional) marginal likelihoods, 

(39) Zk = j TT{w\k)TT{9\w,k)L{y\9,w,k)dwd9, 

to which we then apply 7r(/c) and normalize for iT{k\9); the computational details 
of this procedure are given in Section 4.3 below. 

We now discuss a number of astronomical considerations relevant to our choice 
of priors for this particular modeling problem. 

4.2 Astronomical Motivations for our Priors 

As noted earlier, by considering the well-defined selection function of their ob- 
servational campaign the authors of the original astronomical study were able to 
construct the expected probability density function of recession velocities for their 
survey under the null hypothesis (of a uniform distribution of galaxies throughout 
space). In particular, Postman et al. (1986) recognised that the strict apparent 
magnitude limit of their spectroscopic targetting strategy (m^ < 15.7 mag) would 
act as a luminosity (or absolute magnitude) limit evolving with distance according 
to 

(40) Mr,nmiv) logio(i') - 30, 

where we have assumed v in units of 1000 km s~^ and a "Hubble constant" 
of Hq = 100 km Mpc~^. To estimate the form of the resulting selection 
function, Sma.giv), Postman et al. (1986) considered how the relative number of 
galaxies per unit volume above (i.e., brighter than) this limit would vary with 
distance given the absolute magnitude distribution function, -Fmag(')' fo^' galaxies 
in the local Universe, i.e., 5'mag(^^) oc 1 — -Fmag(-^r,iim(f^))- To approximate the 
latter the astronomers simply integrated over a previous estimate of the local 
luminosity density parameterized as a Schechter function (Schechter, 1976) with 
characteristic magnitude, M* —19.40—1.5 mag, and faint-end slope, a* ^ —1.3, 
such that 

(41) /(M) oc [io2/5(M*-Af)j<+i exp[-102/5(^^;-^^)], 
and 

f{M)dM. 

-oo 
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An interesting feature of magnitude- limited astronomical surveys is that, al- 
though with increasing recession velocity this 5'mag(w) selection function restricts 
their sampling to the decreasing fraction of galaxies above Mrjini(i'), the vol- 
ume of the Universe probed by (the projection into three-dimensional space of) 
their two-dimensional angular viewing window is, in contrast, rapidly increasing. 
Hence, there exists an important additional selection effect, 5'voi(f), operating in 
competition with, and initially dominating, that on magnitude, and scaling with 
(roughly) the third power of distance, or recession velocity, such that 

(43) S^oiiv) oc v\ 

The product of these two effects therefore returns the net selection function 
of the galaxy dataset, which we illustrate (along with each effect in isolation) 
in Figure 4 (see also Figure 4b from Postman et al. 1986). Important to note is 
that the net selection function is distinctly non-uniform; rather than lying flat 
across the range of f , it actually rises to a maximum of almost twice its initial 
power at u ~ 17, beyond which it declines slowly again towards zero at v ^ 40. 
As a result it is difficult to justify a strictly symmetric prior on the component 
weights, w, as has been popular in past analyses (e.g. Roeder &: Wasserman 1997; 
Richardson &: Green 1997). We therefore favor instead an asymmetric Dirichlet 
prior, TT{w\k) ~ ^(<5), with weights, 6 = {Q)"' , j = l,---,^}'; and controlling 
hyperparameter, 7 > 0, on the mean-ordered components (i.e., /ii < . . . < [ik]- 
With the resulting Dirichlet density quite sensitive to the specified 7 we have 
elected to treat this hyperparameter conservatively, specifying a nominal value 
of 0.2, but monitoring carefully the prior-sensitivity of 7r{9\y) to this assumption 
in Section 4.4. 

The selection function of the original survey can also meaningfully inform our 
prior on the distribution of component means (the € 9j). Here we adopt a 
common Normal prior across all given in mean-precision form as 7r{fJ.j) ~ 
AA(k, ^~^); with the hyperparameters, {k = 17, ^ = 0.015}, chosen to give a rea- 
sonable match to the shape of S^agiv) S^oiiv) . Interestingly, this k is quite close 
to that of 20 chosen (a posteriori]) by Richardson & Green (1997) and others, 
though its corresponding ^ is significantly more informative than their choice of 
0.0016 (which is so broad as to place ~20% of the prior mass on components with 
negative mean recession velocities) . For the precision of each mixture component 
(the Tj £ 9j) we adopt a common Gamma prior, vr(Tj) ~ r(a,/3); the hyperpa- 
rameters of which are perhaps not so well constrained on astronomical grounds — 
although we can at least be confident that any large-scale clustering should occur 
above the scale of individual galaxy clusters (~1 Mpc, or Av ~ 0.1) and (unless 
the uniform space-filling hypothesis were correct) well below the width of our 
selection function. Hence, we simply adopt a fixed shape (hyper-) parameter of 
a = 2 and allow the rate (hyper-)parameter to vary as a hyperprior with density, 
7r{f3) ~ r(l,0.05). Our choice here is again comparable to that of Richardson &: 
Green (1997) who suppose 7r(/3) ~ r(0. 2, 0.016)— not r(0.2,0.573) as mis-quoted 
by Aitkin (2001) — though we evidently place far less prior weight on exceedingly 
large precisions (small variances). 

Finally, to construct a prior for the number of components in the mixture we 
refer to yet another aspect of the Postman et al. (1986) survey design; namely, 
the deliberate placement of the five individual windows of the "unfilled" sample 
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Fig 4. Visualization of the galaxy dataset, including its Ahell clusters and selection function. 
The clustering of galaxies in (two-dimensional) recession velocity-declination space is illustrated 
by way of the "cone diagram" shown in the top panel, and its (one-dimensional) projection to 
a recession velocity histogram shown in the bottom panel. In both cases the "unfilled" sample 
of the nominal galaxy dataset and the combined "filled"-!- "unfilled" sample of Postman et al. 
(1986) are indicated (with black and grey markings, respectively) for reference. The positions of 
five Abell clusters targetted by the "unfilled" survey are highlighted here in blue (circles/ arrows) , 
while the corresponding magnitude-dependent, volume- dependent, and net selection functions are 
overplotted (bottom panel) in red (the dotted, dash-dotted, and solid curves, respectively) . 

over five previously-identified galaxy clusters from the Abell catalog (cf. their 
Figure 3), which anchors the mode of our Poisson 7r(A;), i.e., V{\ = 5). With the 
k = 1 and k = 2 mixture models already well excluded by previous analyses, 
and > 10 a pragmatic upper bound for exploration given A = 5, we explicitly 
truncate our prior to the domain, 3 < < 10. This contrasts somewhat with the 
Uniform priors on A; < 10 and k < 30 assumed by Roeder & Wasserman (1997) 
and Richardson & Green (1997), respectively — though reweighting for alternative 
Tr{k) is trivial even without RLR. 

4.3 Application of RLR with MC^ Sampling 

As Lee et al. (2008) highlight in their review of contemporary methods for 
inference on Bayesian mixture models, whether for Gibbs sampling (e.g. Diebolt 
& Robert 1994), random walk MCMC, or otherwise, efficient exploration of the 
mixture model posterior at fixed k can be difficult to achieve due to the intrinsic 
non-identifiability of the component means (not present here with our asymmetric 
Tr{'Wi)) and the imposition of a Dirichlet prior structure on the component weights 
{viz. the simplex constraint, Yl'j=i''^j — ^ with < Wj < 1). Here we have 
opted for an MC'^ strategy to accomplish this task, taking advantage of the RLR 
requirement for exploring multiple bridging densities to hasten the convergence of 
our tempered-likelihood MCMG chains. Rather than update via Gibbs sampling. 
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which can become 'stuck' on particular permutations of the aUocation vector, z, 
we foUow a simple random-walk Metropolis-Hastings prescription. 

For the present analysis (conducted on a modest laptop computer) we run 
our MC^ sampler for each k with twenty temperatures spaced uniformly as t = 
{(0, 1/19, . . . , 19/19)} (giving a reasonably equal spacing of the log Z^f^y-^ for this 
problem), a burn-in phase of 10,000 draws, and a final output of 100,000 draws 
from each t{k'). We then thin each output chain to 5000 draws (for computational 
speed) and run RLR with Kong et al. (2003) 's form for the asymptotic covariance 
matrix to estimate our uncertainties on the derived logZ^. The results of this 
analysis are illustrated in Figure 5. Under the astronomically- motivated priors 
stipulated in Section 4.2 our marginal likelihood computations clearly favor a 
six-to-eight component mixture; the form of which we illustrate for the k = 7 
case (in comparison to a histogram of the galaxy dataset) in the right-hand panel 
of Figure 5. A quick inspection of the latter, however, suggests that three of the 
seven mixture components are being used here to model what might reasonably be 
considered a single, or perhaps double, component of coherent velocity structure. 
Furthermore, had we adopted a flat 7r{k) instead of a A = 5 (truncated) Poisson 
the resulting posterior would have peaked towards even higher k (eight-to-nine) , 
suggesting an even greater degree of 'over- fitting'. 



Galaxy Dataset ^ Galaxy Dataset 

w/ Astro. Motiv. Priors w/ Astro. Motiv. Priors 
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Fig 5. Posterior probabilities for the number of Normal mixture components in the galaxy 
dataset, n{k\y), under our astronomically-mottvated priors (left-hand panel). The solid, blue 
symbols here denote the true posterior, while the open, red symbols indicate for reference the raw 
marginal likelihood-based result, i.e., before application of our Poisson 7r(fc). In each case the 
relevant uncertainties (recovered from estimates of the asymptotic covariance matrix for each 
logZk) are illustrated as la error bars. The inferred probability density (in velocity space) at 
the maximum a posteriori parameterization of our Normal mixture model for k = 7 is then 
illustrated for reference against a scaled histogram of the galaxy dataset m the right-hand panel. 

Given the lack of any strong physical motivation for the assumed Normal 
form, a significant non-Normality of the underlying components could well be 
a good explanation for this result; and indeed such a hypothesis might readily 
be investigated by extending the above model comparison across mixtures of 
Student's t or skew Normal distributions as well. However, within the Normal 
mixture model framework to which we confine ourselves for the present example 
another explanation could lie in the sensitivity of the posterior to the chosen 
priors, and so we investigate this with our RLR output in Section 4.4 below. 
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4.4 Exploration of Prior-Sensitivity 

As noted in Section 1.4, by estimating normalization constants not just for the 
posterior itself but also for the specified sequence of bridging densities linked with 
the prior (or suitable auxiliary) the RLR framework readily faciliates the analy- 
sis of prior-sensitivity via importance-sampling-based reweighting of the output 
draws (cf. Equation 29). We therefore begin here by confirming the flexibility of 
this reweighting procedure, using our RLR output to recover vr(A;|y) under the 
priors used by Richardson & Green (1997) while keeping track of our uncertain- 
ties via the effective sample size (as per Kong et al. 2003); the success of which 
is illustrated in the left-hand panel of Figure 6. At face value the greatest dif- 
ference between our prior construction and that of Richardson & Green (1997) 
might be considered the asymmetry and symmetry, respectively, of the Dirichlet 
distributions chosen for our component weights. As explained in Section 4.2 the 
asymmetry of our 7r(w) is controlled by the parameter, 7, which we set as default 
to a value of 0.2. To investigate the sensitivity of 7r{k\y) to this parameter we 
therefore reweight with two alternatives, 7 = (symmetric) and 7 = 0.4 (more 
asymmetric), keeping all other hyperparameters fixed; the results of which are 
shown in the right-hand panel of Figure 6. Interestingly, we find that our default 
choice of 7 = 0.2 has had only a minor role in shaping the posterior, relative to 
the symmetric alternative, although a 7 as high as 0.4 does indeed force TT{k\y) 
to even further 'over-fitting'. 
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Fig 6. Demonstration of the potential for reweighting of RLR output to investigate the sensitivity 
of Bayesian model selection to the specified priors. In the left-hand panel we reweight our draws 
from the galaxy dataset posterior to match the priors used by Richardson & Green (1997), keeping 
track of our uncertainties via the effective sample size. In the right-hand panel we investigate the 
effect of modifying the parameter, 7, controlling the asymmetry of our Dirichlet prior, revealing 
that this factor alone is insufficient to account for the marked differences between our 7r(fc|y) 
and that of Richardson & Green (1997). 

Continuing with our prior-sensitivity analysis we have recomputed 'K{k\y) un- 
der two and four-fold increase and decreases in each of our remaining hyper- 
parameters. Of these the most marked sensitivity was observed for the precision, 
^, of our Normal prior on the positions^ ^ij, of the component means; recalling 
that ^ = 0.015 was chosen to match the shape of the selection function (in ve- 
locity space) of the original astronomical survey. As we illustrate in the left-hand 
panel of Figure 7 a factor of four decrease to ^ = 0.00375 is in fact sufficient 
to account for much of the difference between our nominal posterior and that of 
Richardson & Green (1997), shifting our vr(A;|y) towards significantly fewer mix- 
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ture components. Moreover, as we show in the right-hand panel of this Figure 
the k = 6 model favored under this modified prior appears visually to give a far 
more satisfactory fit to the binned data. This is of course a little ironic as such a 
broad prior on the component means is much less appealing from an astronomical 
perspective (in that it places significant probability on detecting components far 
outside the bounds of the true selection function). Nevertheless, we hope that 
the analysis presented here has given a clear demonstration of the potential of 
RLR-based marginal likelihood computation for exploring prior-sensitivity in the 
Bayesian model selection context. 
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Fig 7. Continuing demonstration of the potential for reweighttng of RLR output to investigate 
the sensitivity of Bayesian model selection to the specified priors. In the left-hand panel we 
investigate the effect of modifying the parameter, ^, controlling the precision of our Normal 
prior on the position of the component means. Somewhat surprisingly (see discussion) we find 
n{k\y) for the galaxy dataset markedly responsive to modest changes in this particular hyper- 
parameter; with a factor of four decrease in our nominal ^ — 0.015 alone almost sufficient to 
match the Richardson & Green (1997) result. The inferred probabtlity density (in velocity space) 
at the maximum a posteriori parameterization of our Normal mixture model under this rescaled 
hyper-parameter for k = 6 is illustrated for reference against a scaled histogram of the galaxy 
dataset in the right-hand panel. 



5. CONCLUSIONS 

In this paper we have explored both the theoretical foundations of and connec- 
tions between those recursive pathways to marginal likelihood estimation char- 
acterized by reverse logistic regression and the density of states, and we have 
introduced the novel heuristic of 'thermodynamic integration via importance sam- 
pling' for better understanding the role of the bridging sequence in this process. 
Furthermore, by way of our numerical examples we have highlighted a number 
of considerations for maximizing the efficiency of RLR-type schemes (tailoring of 
the bridging sequence to achieve an equal spacing of log ; use of a data-driven 
reference/ auxiliary; use of all draws in nested ellipse-based sampling), and, im- 
portantly, their utility for prior-sensitivity analysis. Though a 'one-size-fits-all' 
algorithm to solve the challenging problem of marginal likelihood estimation re- 
mains elusive we hope that our contribution herein leads to both greater use of 
the recursive pathway itself and greater interest in estimators that facilitate rapid 
recomputation under alternative priors in general. 
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